Morphological analysis for less-resourced languages: Maximum Affix Overlap applied to Zulu

نویسندگان

  • Uwe Quasthoff
  • Sonja Bosch
  • Dirk Goldhahn
چکیده

The paper describes a collaboration approach in progress for morphological analysis of less-resourced languages. The approach is based on firstly, a language-independent machine learning algorithm, Maximum Affix Overlap, that generates candidates for morphological decompositions from an initial set of language-specific training data; and secondly, language-dependent post-processing using language specific patterns. In this paper, the Maximum Affix Overlap algorithm is applied to Zulu, a morphologically complex Bantu language. It can be assumed that the algorithm will work for other Bantu languages and possibly other language families as well. With limited training data and a ranking adapted to the language family, the effort for manual verification can be strongly reduced. The machine generated list is manually verified by humans via a web frontend.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Resource-Rich Languages to Improve Morphological Analysis of Under-Resourced Languages

The world-wide proliferation of digital communications has created the need for language and speech processing systems for underresourced languages. Developing such systems is challenging if only small data sets are available, and the problem is exacerbated for languages with highly productive morphology. However, many under-resourced languages are spoken in multi-lingual environments together ...

متن کامل

Ukwabelana - An open-source morphological Zulu corpus

Zulu is an indigenous language of South Africa, and one of the eleven official languages of that country. It is spoken by about 11 million speakers. Although it is similar in size to some Western languages, e.g. Swedish, it is considerably under-resourced. This paper presents a new open-source morphological corpus for Zulu named Ukwabelana corpus. We describe the agglutinating morphology of Zul...

متن کامل

Semi-automated extraction of morphological grammars for Nguni with special reference to Southern Ndebele

A finite-state morphological grammar for Southern Ndebele, a seriously under-resourced language, has been semi-automatically obtained from a general Nguni morphological analyser, which was bootstrapped from a mature hand-written morphological analyser for Zulu. The results for Southern Ndebele morphological analysis, using the Nguni analyser, are surprisingly good, showing that the Nguni langua...

متن کامل

Exploiting Cross-Linguistic Similarities in Zulu and Xhosa Computational Morphology

This paper investigates the possibilities that cross-linguistic similarities and dissimilarities between related languages offer in terms of bootstrapping a morphological analyser. In this case an existing Zulu morphological analyser prototype (ZulMorph) serves as basis for a Xhosa analyser. The investigation is structured around the morphotactics and the morphophonological alternations of the ...

متن کامل

Machine learning for the analysis of morphologically complex languages

This thesis demonstrates that machine learning can be applied in different ways to automate the analysis of morphologically complex agglutinating languages. Firstly, the target language Zulu, an under-resourced indigenous language of South Africa, is characterised before presenting the UKWABELANA CORPUS. The morphological Zulu corpus has been semiautomatically compiled in close cooperation with...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014